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f--^ ' Abstract. We show that combinations of optimal (stationary) poUcies in unichain 

Markov decision processes are optimal. That is, let be a unichain Markov decision 
process with state space S, action space A and policies tt" : S A {1 < j < n) with 
optimal average infinite horizon reward. Then any combination tt of these policies, where 
for each state i G 5 there is a j such that 7r(z) = 7r°(i), is optimal as well. Furthermore, 
we prove that any mixture of optimal policies, where at each visit in a state i an arbitrary 
• action TT°{i) of an optimal policy is chosen, yields optimal average reward, too. 
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1. Introduction 

> . 

0> . Definition 1.1. A Markov decision process (MDP) 7V1 on a (finite) set of states S witli 
a (finite) set of actions A available in each state G S consists of 

((i)) an initial distribution /iq that specifies the probability of starting in some state in 
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((ii)) the transition probabilities Pa{hj) that specify the probability of reaching state j 



I when choosing action a in state i, and 



((iii)) the payoff distributions with mean ra{i) that specify the random reward for choos- 
ing action a in state i. 

A (stationary) policy on is a mapping tt : S A. 



^ I Note that each policy vr induces a Markov chain on A^. We are interested in MDPs, 

where in each of the induced Markov chains any state is reachable from any other state. 

Definition 1.2. An MDP A4 is called unichain, if for each policy tt the Markov chain 
induced by tt is ergodic, i.e. if the matrix P = {p-K(i){ii i))i,j€S is irreducible. 

It is a well-known fact (cf. e.g. jT|, p.lSOff) that for an ergodic Markov chain with 
transition matrix P there exists a unique invariant and strictly positive distribution /i, 
such that independent of the initial distribution /xq one has = -Pn/Wo where 
P„ = }i'YTj=iP^-^ Thus, given a policy tt on a unichain MDP that induces a Markov 
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""^ Actually, for aperiodic Markov chains one has even P"/io — * t^-, while the convergence behavior of 
periodic Markov chains can be described more precisely. However, for our purposes the stated fact is 
sufhcient. 
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chain with invariant distribution jj, the average reward of that pohcy can be defined as 

A pohcy 71° is caUed optimal if for ah pohcies vr: V{n) < V{7t°). It can be shown ([2], 
p.360ff ) that in the unichain case the optimal value V{7r°) cannot be increased by allowing 
time-dependent policies, as there is always a stationary (time-independent) policy that 
gains optimal average reward, which is why we consider only stationary policies. 

In this setting we are going to prove that combinations of optimal policies are optimal 
as well. 

Theorem 1.1. Let Ai be a unichain MDP with state space S and tt^, optimal policies 
on A4. Then any combination ir of these policies where for each state i ^ S either 
7r(z) = 7r°(i) or n{i) = 7^2(^) is optimal as well. 

Obviously, if two combined optimal policies are optimal, so are combinations of an 
arbitrary number of optimal policies. Thus, one immediately obtains that the set of 
optimal policies is closed under combination. 

Corollary 1.1. Let M. be a unichain MDP with state space S. A policy tt is optimal on 
M. if and only if for each state i there is an optimal policy n° with nil) = 7r°(i). 

2. Proof of Theorem 11.11 

We start with a result about the distributions of policies that differ in at most two 
states. 

Lemma 2.1. Let M. be a unichain MDP with state space S. Let ttoo, ttqi, ttiq, tth be four 
policies on Ai with invariant distributions /xoo = (fli)ie5; /^oi = (^j)iG5> /^lo = {ci)ies, 
fJ'U = {di)i^s one? si, S2 two states in S such that 

((i)) for alii e S \ {si, S2}: vroo(i) = vroi(i) = 7rio(i) = rcu{i), 
((ii)) 7roo(si) = 7roi(si) ^ 7rio(si) = 7rii(si), 
((iii)) 7roo(s2) = 7rio(s2) 7^ 7roi(s2) = 7rii(s2). 

Then each of the distributions fiij is uniquely determined by the other three. More pre- 
cisely, e.g.^ for all states i 

Proof. Since M. is unichain, the distributions all uniquely determined by the 

transition matrices Pij of the Markov chains induced by the policies iTij. By assumption 
(i), the matrices P^j share all rows except rows Si,S2, which we may assume to be the 
first and second row, respectively. Furthermore, by (ii), Pqo and Pqi share the first row 
as well as Pio and Pu. Finally, by (iii) we have equal second rows in Pqo and Pio as well 
as in Pqi and Pu. 

^For the sake of readability we don't give a general formula but only reproduce how to calculate fxn. 
Since the situation is symmetric it is easy to see (but a bit tedious to write down or read) how the general 
formula looks like. 
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Since by assumption the distributions are invariant, we have j^ijPij = Hij- Writing the 
probabihties in Pqo as ptj and those in the first two rows of Pn as qij, it follows that for 
each state i: 

(1) ai = ^ajPji, bi = b2q2i+^ bjPji, a = Ciqu + ^ CjPji. 
j^s jes\{2} jes\{i} 

Setting u = {I'ijii^s with z/j := 0261 Cj — aj6iC2 + aibiC2, one has by ([T} 

(z/Pii)i = a2biCiqii + aib2C2q2i + ^ a2biCjPji 

ie5\{i,2} 

- ^ ajbiC2Pji + ^ aibjC2Pji 
je5\{i,2} ies\{i,2} 



a2biCiqii + aib2C2q2i + ^2^1 (ci - Ciqu - C2P2i 
-61C2 (^tti - aipii - a2P2i^ + aiC2 (bi - bipu - 6292*) 



= a2biCi - aibiC2 + aibiC2 = Vi 

Hence, normalizing v one has an invariant distribution of Pn, which by assumption is 
unique and consequently identical to /in. □ 

With this information on the distributions, we are able to tell something about the 
average rewards of the policies as well. 

Lemma 2.2. Let ttoo, vtoi, ttiq, tth and si,S2 he as in Lemma \2. 1\ and denote the average 
rewards of the policies by Vbo, Voi, Vio, Vn. Let a,b E {0, 1} and set -ix := 1 — x. Then it 
cannot be the case that Vab > V-^a,b,Va,^b o,nd V^a,^b > ^-^a.fei Analogously, it cannot 

hold that Vab < V^a,b, Va,^b and Ka,-f> < Ka,f>, K,-fe- 

Proof. For the sake of readability, we prove the case a = 6 = 0. The other cases follow 
by symmetry. Actually, we will show that if Vqo > Voi,Vio, then Vqo > Vu. Since one 
has analogously the imphcation that if Vn > Voi,Vio, then Vu > Vqo, the assumptions 
Vqo > Hii Vio and Vu > Vqi, Vio obviously lead to a contradiction. 

Similarly as in the case of transition probabilities, in the following we write for the 
rewards of the policy ttoo simply instead of r^^^,(j)(i). For the deviating rewards in state 
Si under policies ttiq, tth and state S2 under ttqi, tth we write r[ and rg, respectively. Then 
we have 

Vbo = "^CLiri, Vol =b2r'2 + ^ hri, 

i&S ieS\{2} 

Vio = cir[ + ^ dri, Vu =dir[ + ^2^2 + ^ diri. 

ieS\{l} iG5\{l,2} 
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If we now assume that Vqo > Voi, Vio, the first three equations yield 

(2) 62^2 < a2r2 + ^ (fli - bi)ri, 

ieS\{2} 

(3) cir[ < ain + ^ (oj - q)^, 

^es\{l} 

while applying Lemma f2. II to the fourth equation gives 

Vu = - (a2biCir[ + aih2C2r'2 + ^ (a2&iCi - aj6iC2 + aiC26i)ri j , 

ies\{i,2} 

where a = 0261 — 61 C2 + aiC2. Substituting according to © and Q then yields 

(a, - Ci)r, j + ^ (^a2r2 + J]] (a^ - 6»)r, j 

iGS\{l} iGS\{2} 

H > , c^rj > a^rj H > ^jr^ 

je5\{l,2} iG5\{l,2} ieS\{l,2} 

- faia26iri + 02^1 (02 - C2)r2 + aia2C2r2 + aiC2(ai - 6i)ri 
a V 

+ (0261 + aiC2 - 61C2) ^ a^rj^ 
ie5\{i,2} 

0261 - 61C2 + aiC2 / , , \ 

= ( ai^-i + 02^2 + aiTi j = Voo. 

ie5\{i,2} 

Obviously, replacing '>' with '>', '<' or '<' throughout the proof yields the analogous 
result for the other cases, which finishes the proof. □ 

The following is a collection of simple consequences of Lemma 12.21 

Corollary 2.1. Let Vqo, Vqi, Vio, Vii and a,b be as in Lemma \2.iA Then the following 
implications hold: 

((i)) Vab <Va,^b,V^a,b =^ > min(K,-fe, "^-a.fe) • 

((ii)) Vab > Va,^b,V^a,b =^ Ka,-fe < max{Va,^b, V-^a,b) ■ 

((iii)) Vab < Va,^b,V^a,b =^ Ka,-fe > min(K,-fe, "^-.a.fe) • 

((iv)) Vab > Va,^b,V-^a,b =^ V^a,^b < maxiVa^^b, V-^a,b) ■ 

((V)) Vab = Va,^b = V^a,b =^ V^a,^b = Vaf 

((vi)) Vab, V-^a,^b > Va,^b, V-^a,b =^ Vab = K^^fe = V^a,b = V-^a,^b- 

Proof, (i), (ii), (iii), (iv) are mere reformulations of Lemma 12.21 while (vi) is an easy 
consequence. Thus let us consider (v). If V-^a,^b were < Va^^b = V-^a,b, then by Lemma 
Vab > niin(Vl,a^b, K,^^), contradicting our assumption. Since a similar contradiction crops 
up if we assume that V-,a,-.6 > = V-^a,b, it follows that V-,a,-.6 = = V-^a,b = Vab- n 

Now, in order to prove the theorem, we ignore all states where the optimal policies 
7il, coincide. For the remaining s states we denote the actions of tt^ by and those of 
n2 by 1. Thus any combination of 7!'i,H2 '^^^ t)e expressed as a sequence of s elements 
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G {0, 1}, where we assume an arbitrary order on the set of states (take e.g. the one 
used in the matrices Pij). We now define sets of pohcies or sequences, respectively, as 
follows: First, let Gj be the set of policies with exactly i occurrences of 1. Then set 
Ho := Bo = {00 . . . 0}, and for 1 < z < s 

Ui:={nEQi\di7r,TTU) = l}, 

where d denotes the Hamming distance, and n* is a (fixed) policy in Ilj with V{tt*) = 
max^gHi ^(tt). Thus, a policy is G Ilj, if and only if it can be obtained from 7r*_^ by 
replacing a with a 1. 



Lemma 2.3. V{n*_^) > V{it*) fori <i<s. 



Proof. The lemma obviously holds for i = 1, since tTq = 00 . . . = 7r° is by assumption 
optimal. Proceeding by induction, let z > 1 and assume that V{7r*_2) > V{7r*_i). By 
construction of the elements in each Ilj the policies vr*_2, 7r*__^ and tt* differ in at most 
two states, i.e. the situation is as follows: 



TT, 



i-1 

tt' 



...0...0 
...1...0 

...0...1 



Define a policy vr' G nj_i as indicated above. Then V{7r*_2) > V{7i*_{) > V^tt') by 
induction assumption and optimality of n*_-^ in nj_i. Applying (iv) of Corollary 12.11 
yields that V{n*) < max(y{7r*^^), = F(7r*_i), which proves the lemma. □ 



Since the policies tTq = 00 . . . = vr^ and tt* = 11 . . . 1 = are assumed to be optimal, 
it follows that all policies tt* are optimal as well. Now we are able to prove the Theorem 
by induction on the number of states s where the policies 7r°,7r2 differ. For s = 1 it is 
trivial, while for s = 2 there are two combinations of vr^ = ttq and = vr|. One of them 
is identical to tt^ and hence optimal, while the other one is optimal due to Corollarv 12.11 

Thus, let us assume that s > 2. Then we have already shown that the policies tt* 
and hence in particular nl = 00 . . . 010 ... and 7r*_i = 11 . . . 101 ... 1 are optimal. Since 
ttI and n* = 11 ... 1 = are optimal policies that share a common digit in position 
k, we may conclude by induction assumption that all policies with a 1 in position k are 
optimal. A similar argument applied to the policies ttq = 00 . . . and 7r*_i shows that all 
policies with a in position i (the position of the in 7r*_^) are optimal. Note that by 
construction of the sets Ilj, k i. Thus, we have shown that all considered policies are 
optimal, except those with a 1 in position i and a in position k. However, as all policies 
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of the form 

k e 

...0...0... 
...1...0... 

are optimal, a final application of Corollary 12.11 (v) shows that these are optimal as 
well. □ 

3. Mixing Optimal Policies 

Theorem 11.11 can be extended to mixing optimal policies, that is, our policies are not 
deterministic {pure) anymore, but in each state we choose an action randomly. Building 
up on Theorem II. II we can show that any mixture of optimal policies is optimal as well. 

Theorem 3.1. Let U* be a set of pure optimal policies on a unichain MDP M.. Then 
any policy that chooses at each visit in each state i randomly an action a such that there 
is a policy vr G 11* with a = vr(z), is optimal. 

The theorem will be obtained with the help of the following Lemma. 

Lemma 3.1. Let tti, 7r2 be two policies on a unichain MDP M. that differ only in a single 
state si, I.e. 7ri(i) = ■K2{i) for alii ^ si and'Ki{si) 7^ 71-2(51). Let = {ai)i(zs, IJ2 = (&i)igs 
be the invariant distributions and Vi,V2 the average rewards of ni and 112, respectively. 
Then the mixed policy n that chooses in si action 7ri(si) with probability A and 7r2(si) 
with probability (1 — A) and coincides with tTj in all other states has invariant distribution 
A* = {ci)iGS with 

Xttibs-, + (1 - X)as-,bi 
Xbs^ + (1 - 

and average reward 

_ Xb^V, + (1 - X)aiV2 
Xbi + (1 - A)ai 

Proof. First, note that the transition matrices Pi,P2 of 7ri,7r2 and P of n share all rows 
except row Si, which we assume to be the first row. Furthermore we write Pij for Pni{i){h j) 
and qij for pT,^(^si){si,i), so that the entries of row 1 in P are of the form Xpij + (1 — X)qij. 
Now, let z/ := with i/i = Xaibi + (1 — A)ai6j. Then 

{uP)i = aibi{Xpu + (1 - X)qu) + ^ {Xajbi + {1 - X)aibj)pji = 

je5\{i} 

= Xbi{aipu+ ^ ajpji) + {1 - X)ai{biqii + ^ ^jPji) = 

ie5\{i} ie5\{i} 
= Xaibi + (1 — X)aibi = z/j, 

since the a^'s and fej's form an invariant distribution of Pi, P2, respectively. Since the q's 
are only a normalized version of the t'j's, this finishes the first part of the proof. 
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Now, given the invariant distribution of vr, its average reward can be written as V 
Z]iG5 Cj'^TrW • Thus, writing for r^j(i) and r[ for r^2(si) one has 



^ = Xb, + (1 - A)a, (^^^ + - + A6, + (1 - A)a, 

■ (^A6i J^a^ri + (1 - A)ai(6iri + ^ fejri) 



Xbi + (1 - A)ai 

AbiFi + (1 - A)aiy2 
Xbi + (1 - A)ai 



□ 



Proof of Theorem \,y. 11 Let us first assume the simplest case, where we have two optimal 
policies 7r°, that differ only in some state single Si. By Lemma f3. 11 the policy resulting 
from Tcl and when mixing actions in state Si has the same average reward as nl and TTg 
and therefore is optimal. Now, each mixture of actions in a single state can be interpreted 
as a new action in this state. Thus, proceeding by induction, the mixture of n optimal 
policies ttJ', . . . , 7r° that differ only in a single state is optimal as well. 

Now, in the general case, where we want to mix actions in s > 1 states, we have at each 
state i the actions (of some pure optimal policies G 11*) at our disposal. By 

Theorem 11.11 all combinations (aj^,ctj2' • • • '^iJ with 1 < ji < ki for all i are optimal as 
well. Thus, we may fix the actions in s — 1 states so that we have e.g. optimal policies 
of the form (a],a^, . . . ,al) with 1 < j < ki. As we have seen above, all policies that 
are obtained by mixing all available actions in the first state are optimal. Furthermore, 
each mixture can again be interpreted as new available action, so that we may repeat our 
argument for each of the remaining states, thus showing that each mixed optimal policy 
is optimal, too. 

So far, we have considered only the case where the relative frequencies with which the 
actions in a fixed state are chosen converge. If this does not hold, it may happen that the 
process does not converge to an invariant distribution. However, the average rewards after 
t steps converge nevertheless. Let A* (a) be the relative frequency with which action a was 
chosen in state i after t steps in i, and let fit be the distribution over the states after these 
t steps. Then the average reward Vt thereby obtained is X]ies/^*(^) This is 

of course also the expected average reward after t steps when constantly choosing action 
a in state i with probability Aj(a) := A*(a) for each i,a. As each of these sequences has 
already been shown to converge to the optimal value V*, we have the following situation. 
For each Vj^ of the sequence (Vj)tgN there is a sequence (yt{'K))tm with limj^oo ^(^) = V* 
such that Vt^{'JT) = Vt^. It follows that limt^oo = V*. □ 



4. Extensions, Applications and Remarks 



4.1. Optimality is Necessary. Given some policies with equal average reward y, in 
general, it is not the case that a combination of these policies again has average reward 
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V, as the following example shows. Thus, optimality is a necessary condition in Theorem 



Example 4.1. Let S = {si,S2} and A = {01,02}. The transition probabilities are given 



while the rewards are ra^(si) = raj(s2) = and ra^isi) = ^02(52) = 1. Since the transition 
probabilities of all policies are identical, policy (02,02) with an average reward of 1 is 
obviously optimal. Policy (02,02) can be obtained as a combination of the policies (01,02), 



4.2. Multichain and Infinite MDPs. Theorem 11.11 does not hold for MDPs that are 
not unichain as the following simple example demonstrates. 

Example 4.2. Let S = {si,S2} and A = {01,02}. The transition probabilities are given 
by 



while the rewards are ^^^(si) = ra^(s2) = 1 and ra2{si) = ra2{s2) = 0. Then the poli- 
cies (oi, Oi), (oi, 02), (02, Oi) all gain an average reward of 1 and are optimal, while the 
combined policy yields suboptimal average reward 0. 

Even though this seems to be quite a strict counterexample (note that the MDP is even 
communicating), we think that in certain restricted settings Theorems 11.11 and Kill will 
hold as well. For example, adding a set of states that are transient under every policy does 
not matter. Furthermore, if the components of a multichain MDP are the same under 
every policy, it is obvious that the Theorems hold as well. However, things become more 
complicated, if the set of transient states or the components change with the policy as in 
Example 14.21 Nevertheless, extensions of our results to the multichain case don't seem to 
be impossible as such, but may work under some clever restrictions, e.g. by combining 
exclusively in states that are not transient under any policy. In any case the main task 
when working on such extensions will probably be to determine what policy changes will 
result in what changes in the set of transient states and components, respectively. 

The situation for MDPs with countable set of states/actions is similar. Under the 
(strong) assumption that there exists a unique invariant and positive distribution for 
each policy. Theorems 11.11 and 13.11 also hold for these MDPs. In this case the proofs are 
identical to the case of finite MDPs (with the only difference that the induction becomes 
transfinite) . However, in general, countable MDPs are much harder to handle as optimal 
policies need not be stationary anymore (cf. [2], p.413f). 



by 




and (02,01), which however only yield an average reward of \. 




4.3. An Application. Even though the presented results may seem more of theoretical 
interest, there is a straightforward application of Theorem 13. H which actually was the 
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starting point of this paper. Consider an algorithm operating on an MDP that every 
now and then recalculates the optimal policy according to its estimates of the transition 
probabilities and the rewards, respectively. Sooner or later the estimates are good enough 
so that the calculated policy is indeed an optimal one. However, if there is more than one 
optimal policy, it may happen that the algorithm does not stick to a single optimal policy 
but starts mixing optimal policies irregularly. Theorem 13.11 guarantees that the average 
reward of such a process again is still optimal. 

5. Conclusion 

We conclude with a more philosophical remark. MDPs are usually presented as a 
standard example for decision processes with delayed feedback. That is, an optimal policy 
often has to accept locally small rewards in present states in order to gain large rewards 
later in future states. One may think that this induces some sort of context in which 
actions are optimal, e.g. that choosing a locally suboptimal action only "makes sense" in 
the context of heading to the higher reward states. Our results however show that this is 
not the case and optimal actions are rather optimal in any context. 
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